Linux Journal December 2011 by Linux Journal

Linux Journal December 2011 by Linux Journal

Author:Linux Journal
Language: eng
Format: mobi, epub
Tags: SQS, Mercurial, Linux, EFI, Databases, OpenRISC, Netfilter
Publisher: Belltown Media
Published: 2011-11-29T08:00:00+00:00


require ‘rubygems’

require ‘json’

require ‘net/http’

jobname = ARGV[0]

hostname = ARGV[1]

url = URI.parse(“http://#{hostname}/job/#{jobname}/

↳lastBuild/api/json”)

res = JSON.parse(Net::HTTP.get_response(url).body)

lastResult = res[“result”]

if lastResult == “SUCCESS”

puts “OK|Status=0”

exit(0)

else

failurl = URI.parse(“http://#{hostname}/job/

↳#{jobname}/api/json”)

failres = JSON.parse(Net::HTTP.get_response(failurl).body)

health = failres[“healthReport”][0][“description”]

puts “Job #{jobname} broke: #{health}”

exit(1)

end

The monitoring system calls the code with command-line parameters of the name of the job and the name of the host. The code then looks for the result from the Hudson server and checks for success. The return value and exit code are how the monitoring script replies to the monitoring system. A nonzero exit code indicates a failure, and the return value is a string that the system displays as the reason for the failure. On Zenoss, this is also used in deduplication. On success, the monitoring script has an exit code of 0 with a string returned in a special form for the system to process (see code).

Using this structure, system administrators can work with developers to build custom URLs that the monitoring system can access to determine the health of the application without worrying about every system in the set.

It may seem hard to swallow that it’s acceptable to leave a box down overnight. It may be the first in a cascading series of failures that cause multiple servers to go down, eventually resulting in a downed service, but this can be addressed directly from the load balancer or front-end appliance instead of indirectly looking at the boxes themselves. Using this method, the alert can be set to go off after a certain number of boxes fail at certain times of day, and there is no need to solve harder problems, such as requiring each box to know the state of the entire cluster.

So far, the design for the systems has been fairly agnostic as far as geographies and cloud footprint. For most applications, this doesn’t make a lot of difference. Usually, with multiple geographies, each data center has its own instance of the monitoring system with each one monitoring its siblings in the other locations. Operating in the cloud offers greater flexibility. Although it still is necessary to monitor the monitoring system, this can be done easily using Amazon’s great, but far less configurable system to monitor Nagios or Zenoss EC2 instances.

What really stands out about Amazon’s cloud is that it’s elastic. Hooking up the EC2 command-line programs to the monitoring service will allow new boxes to be launched if some are experiencing problems due to resource starvation, load or programs crashing on the box. Of course, this needs to be kept in check, or the number of instances could spiral out of control, but within reasonable bounds, launching new instances in place of crashing or overloaded ones from inside of a monitoring script is relatively easy.

Here is an example of a script that monitors the load of a Hadoop cluster and adds more boxes as the number of jobs running increases:

#!/bin/bash

# Call as:

# increase_amazon_set.sh ${threshold} ${AMI}



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.